you will
Introduce what motivates your Data Analysis (DA)
The world population just break through 8 billion recently. According to World Food Programme, around 828 million of people go to bed hungry every night all around the world and more than 49 million people in 49 countries are facing famine crisis. Our team is interested in exploring the relationships between food supply, food waste, GDP, population density and available land for agriculture among different countries. Furthermore, we are looking forward to explore potential opportunities to mitigate food-waste and calculate world food availability for our current population density.
As such, we aim to examine the relationship of food security on a worldwide scale using several macro indicators to better understand factors that contribute to food waste. The main variables that we have built our model on are Food Production, GDP, Agriculture as Percentage of GDP, Land Used for Agriculture, Population, and Regions/Continents. One of the primary focuses of our analysis is to look into how food waste may differ by regions and the factors driving the disparity. Therefore, we have selected variables that can, to an extent, provide insight on characteristics of a region.
Some of the hypotheses driving our analysis can be seen below: GDP is a relatively effective signal of a country’s development; thus, “The developed regions waste a greater percentage of food”. “The greater quantity of food production, the greater amount of food waste percentage” “The greater percentage of agriculture as GDP, the greater amount of food wasted”
While our analysis is quantitatively-driven, …
Here, we started our analysis by checking out all the countries in an easy to see heatmap. Right off the bat, one can notice that the United States is a main contributor to worldwide Food Waste, so we may want to focus our attention to this country. Also, we know that the US is one country with a high GDP, so maybe other high GDP countries follow suit. For the map, we had to change the names of some of our countries names in the data sets so that they would match that of the maps library. For example, United States of America changes to USA. Also, note how some countries do not have data which will be explained in our flaws section.
## Warning: `summarise_each_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `across()` instead.
## ℹ The deprecated feature was likely used in the dplyr package.
## Please report the issue at <]8;;https://github.com/tidyverse/dplyr/issueshttps://github.com/tidyverse/dplyr/issues]8;;>.
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
US have the most “over 40% food loss per commodity per year” up to today, which is around two times the amount Mexico have. It is unimaginable how some commodity lose almost of its amount during production & retail process. We want to look deeper into which commodity are incurring the most loss in US and if there is any reason behind it.
The following graph indicates that in the US, Pineapple juice, Orange juice and grapefruit juice have been wasted the most over the past decades. One commonality between the most-wasted-commodity is that they are all juice. We may want to explore where did most of the waste occur in the production & retail process of these juice.
The following graph explores the sum food waste per year in the United
States. In 2008, US incurred the most amount of food waste in the past
five decades. A possible explanation for this severe food waste year is
that during the depression, a lot of food are wasted because a lot of
retailer and food processor went out of business. We may need more data
to back up our hypothesis.
The following data further explore the most wasted food in US. Besides
from juice, we observe that Canned mushrooms, Tomatoes and Spinach are
also among the highest wasted food in the US.
## # A tibble: 85 × 2
## commodity sum_loss_per_year
## <chr> <dbl>
## 1 Pineapple juice 2012.
## 2 Canned mushrooms 1705
## 3 Orange juice 1480.
## 4 Grapefruit juice 1460.
## 5 Apple juice 1307.
## 6 Green garlic 935.
## 7 Grape juice 933.
## 8 Tomatoes 756.
## 9 Spinach 618.
## 10 Okra 592.
## # … with 75 more rows
Our whole analysis revolved around food waste loss in different countries over time. Therefore, not only can we focus on food waste loss in a single year, but also food waste within a span. Therefore our questions are based around this concept: Has food waste decreased/increased over time? Is there a relationship between a country’s GDP and food waste ? Are there specific countries that are outliers (waste much more/less food than everyone else) What is the projected food waste loss for a specific region in 2023?
Depth of the DA
Answering these questions may seem simple at first glance, but as it turns out, there are a bunch of countries in the world. In order to be able to compare all of the different countries’ food loss within a certain timeframe, we want to be able to display the data in a simple user-friendly manner.
For this, we could plot every country in a heatmap. From this heatmap, we started honing in a little on the US as one of the biggest food waste offender, which led us to the bar graphs right below it, where we start seeing more specific example of just how much food the US wastes compares to other countries. Then, we wondered how worldwide sum loss has been doing over. Surprisingly, for the most part, it seems that we are on a downward trend (Shame on you United States).
As we delved deeper into the World Wide food waste and production, we started to question whether GDP and population had any sort of the correlation with the former variables. Thus, our regression models came to be. Our findings can be found there.
Modeling and Inference
suppressPackageStartupMessages(library(olsrr))
suppressPackageStartupMessages(library(kableExtra))
suppressPackageStartupMessages(library(corrplot))
##
## Call:
## lm(formula = Loss ~ Production, data = AggData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -585087131 -115732941 -28574344 142025080 663246104
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.700e+08 8.753e+07 9.939 1.28e-13 ***
## Production 1.818e-02 5.559e-03 3.270 0.00191 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 208500000 on 52 degrees of freedom
## Multiple R-squared: 0.1706, Adjusted R-squared: 0.1546
## F-statistic: 10.7 on 1 and 52 DF, p-value: 0.00191
## Loss Production
## Loss 1.0000000 0.4130241
## Production 0.4130241 1.0000000
##
## Call:
## lm(formula = Loss ~ Year + GDP + AgriGDP + AgriLand + Population +
## Production, data = AggData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -401785287 -99038421 -10308608 112454861 588290004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.277e+11 2.190e+11 -2.866 0.00625 **
## Year 3.224e+08 1.127e+08 2.860 0.00635 **
## GDP 6.403e-06 9.913e-06 0.646 0.52157
## AgriGDP -5.089e+07 7.730e+07 -0.658 0.51358
## AgriLand 3.005e+08 1.639e+08 1.834 0.07316 .
## Population -5.027e+00 1.449e+00 -3.470 0.00114 **
## Production 2.169e-01 1.012e-01 2.143 0.03745 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 182800000 on 46 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.4239, Adjusted R-squared: 0.3488
## F-statistic: 5.642 on 6 and 46 DF, p-value: 0.0001869
| Index | n | Predictors | R-Square | Adj R-Square |
|---|---|---|---|---|
| 1 | 1 | AgriGDP | 0.2347192 | 0.2197137 |
| 2 | 1 | AgriLand | 0.2133760 | 0.1982486 |
| 3 | 1 | Year | 0.1939272 | 0.1784258 |
| 4 | 1 | Population | 0.1888800 | 0.1732816 |
| 5 | 1 | Production | 0.1705889 | 0.1546387 |
| 6 | 1 | GDP | 0.1439178 | 0.1274546 |
| 7 | 2 | Year Population | 0.2951072 | 0.2674643 |
| 8 | 2 | GDP AgriLand | 0.2525274 | 0.2232147 |
| 9 | 2 | AgriLand Production | 0.2519245 | 0.2225882 |
| 10 | 2 | Year AgriLand | 0.2475960 | 0.2180899 |
##
## Call:
## lm(formula = Loss ~ Year + AgriGDP + AgriLand, data = AggData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -469534446 -122527744 -10308033 109867013 729052731
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.201e+09 1.740e+10 -0.069 0.945
## Year 8.247e+05 6.025e+06 0.137 0.892
## AgriGDP -3.736e+07 7.739e+07 -0.483 0.631
## AgriLand 2.481e+07 1.591e+08 0.156 0.877
##
## Residual standard error: 204100000 on 49 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.2351, Adjusted R-squared: 0.1883
## F-statistic: 5.02 on 3 and 49 DF, p-value: 0.004101
##
## Selection Summary
## ----------------------------------------------------------------------------------
## Variable Adj.
## Step Entered R-Square R-Square C(p) AIC RMSE
## ----------------------------------------------------------------------------------
## 1 AgriGDP 0.2347 0.2197 12.1090 2180.4885 200101907.9473
## ----------------------------------------------------------------------------------
##
## Call:
## lm(formula = Loss ~ AgriGDP, data = AggData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -462030474 -121393937 -9332016 107418209 728487574
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1415466065 73646340 19.220 < 2e-16 ***
## AgriGDP -49240206 12450043 -3.955 0.000237 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 200100000 on 51 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.2347, Adjusted R-squared: 0.2197
## F-statistic: 15.64 on 1 and 51 DF, p-value: 0.0002369
## Loss AgriGDP
## Loss 1.0000000 -0.4844783
## AgriGDP -0.4844783 1.0000000
Explain the flaws and limitations of your analysis
One assumption that we initially made was that our food waste dataset held data for every country and every year. We were first proven wrong when we realized that some countries ceased to exist or began existing throughout the time period recorded. Also, there are years in which some countries may be missing data. This is due to the fact that the way our data was initially collected by the FAO (Food and Agriculture Org) was basically by going through old documents that the country may have provided. This does not work when countries do not really prioritize collecting this information.
The second flaw (no consistent data on a year-to-year basis) led to us believing that we could make a yearly analysis for every country. Therefore, when we started pumping out some graphs and a heat map for certain years, we got some “blank” spaces. Thus we realized that this flaw sort of forced us to work on a over-the-year basis instead. Of course this was not the case for every country, as places such as the United Stated are usually always good about collecting data. These blank spaces could also affect our overall world data, because during periods of instability, there are some countries that may not care about collecting data, so that is left out of our world analysis. As for the flaw about countries coming about and leaving, we do not think that it will really affect our data too much due to the fact we started focusing on the world as a whole. We feel that working with the world as a whole minimizes that small data issues that indidividual countries may have.
Clarity Figures
We believe that our graphs are actually quite easy to glance at. While making these graphs, we wanted to design them in a way that they could possibly be the first iteration of something that can be used in the “interactive” part of this project. For example, the heatmap we made at the beginning quickly helped us understand which country was a “problem” when it came to food waste.
NOTE: Your Data Analysis can be broken up into multiple pages if that helps with your organization.